Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pure julia exp function #19831

Merged
merged 2 commits into from Jan 9, 2017
Merged

Add pure julia exp function #19831

merged 2 commits into from Jan 9, 2017

Conversation

musm
Copy link
Contributor

@musm musm commented Jan 3, 2017

Need to add benchmarks, but it is faster or the same speed as the openlibm version between 1.0 (same speed) to 1.15 to to 1.3 to 1.5 times faster depending on which branch you test. (Testing using BenchmarkTools). On most branches it's faster, I think there is only one branch where it's a basically the same speed.

Accuracy is basically the same (both based on msun version)

function exp{T<:Union{Float32,Float64}}(x::T)
xu = reinterpret(Unsigned, x)
xs = xu & ~sign_mask(T)
xsb = Int(xu >> Unsigned(sizeof(T)*8 - 1))
Copy link
Contributor Author

@musm musm Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just as efficient to use signbit, maybe I should use it instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, would be clearer


using Base: sign_mask, exponent_mask, exponent_one, exponent_bias,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to import any of these since we don't extend them

@musm
Copy link
Contributor Author

musm commented Jan 3, 2017

i686 error is the usual failing error and is unrelated

@@ -70,11 +71,23 @@ end

# evaluate p[1] + x * (p[2] + x * (....)), i.e. a polynomial via Horner's rule
macro horner(x, p...)
@gensym val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Won't the macro hygiene automatically convert t into a unique name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I think this was an issue in an earlier version of Julia, but in 0.6 converts it into a unique name.

MINEXP(::Type{Float32}) = -103.97207708f0 # log 2^-150

# coefficients from: lib/msun/src/e_exp.c
@inline exp_kernel{T<:Float64}(x::T) = @horner_oftype(x, 1.66666666666666019037e-1,
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x::Float64 ?


# evaluate p[1] + x * (p[2] + x * (....)), i.e. a polynomial via Horner's rule
# and convert coefficients to same type as x
macro horner_oftype(x, p...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this macro needed? If you change the floating-point type, you generally will change the degree of the polynomial as well, and you can just use literals of the correct type, no? e.g. it looks like you only use it for Float64 and Float32 below, and Julia has a compact syntax for Float32 literals (append f0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it will generally change the degree of the polynomial. That being said, for example one case where it doesn't really change is if you also want to have native Float16 support (although on CPUs this is useless) the dispatch on the same polynomial (the one for ::Float32) works well in practice (from many cases on different function I found it's really hard to find good coefficients by reducing the degree from the Float32 version for Float16). For example this function will work just as well within accuracy for Float16.  Thus it's helpful to define this macro to take care of such dispatch cases.

Copy link
Member

@stevengj stevengj Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not add the extra macro for an odd case (it is unusual that you wouldn't be able to reduce the degree for Float16) that we don't really implement (probably on CPUs we will just compute exp(x::Float16) by converting to/from Float32, no?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not add the extra macro for an odd case (it is unusual that you wouldn't be able to reduce the degree for Float16) that we don't really implement

Not really unusual for Float16 (as opposed to going from Float128 -> Float64). You have a lot less precision so it's very hard to find an even smaller degree polynomial that retains the same accuracy constraint ( < 1 ulp). This is the hardest/time consuming part testing accuracy and the polynomial coeffs.

  (probably` on CPUs we will just compute exp(x::Float16) by converting to/from Float32, no).

yes

I will remove anyways

MINEXP(::Type{Float32}) = -103.97207708f0 # log 2^-150

# coefficients from: lib/msun/src/e_exp.c
@inline exp_kernel{T<:Float64}(x::T) = @horner_oftype(x, 1.66666666666666019037e-1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just exp_kernel(x::Float64)?

Copy link
Contributor Author

@musm musm Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally my motivation was to have two or more polynomials depending on the float type and use dispatch as follows:
- A low degree polynomial for Float16 and Float32
- A higher degree polynomial for FLoat64 and other Floats like Float128

so in: https://github.com/JuliaMath/Amal.jl/blob/master/src/exp.jl

I have the two polynomial kernels dispatch the following types

SmallFloat = Union{Float16,Float32}
LargeFloat = Float64   # perhaps Union{Float64,Float128} in the future

That grand goal might be a pipedream. So if you all prefer I can restrict the types instead of using this and the @horner_oftype machinery. (although it does have some elegance  to have this this current machinery with horner_oftype and the polynomial dispatch on the type)

Copy link
Member

@stevengj stevengj Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer eliminating the extra macro for now.

(Wouldn't the Float16 computations be faster in Float32 arithmetic anyway? And wouldn't we need a higher degree for Float128 than for Float64?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it would be more efficient on the CPU. The motivation was for GPU code where you do have native Float16 hardware and there is a real benefit (there is some very nice work coming along in having better Float16 support). For Float128 I have not tested yet, but I suspect so.

I guess I can revisit when the time comes and restrict the polynomial dispatch and the additional macro.

# Float64, which is well within the allowable error. however,
# 2.718281828459045 is closer to the true value so we prefer that answer,
# given that 1.0 is such an important argument value.
if x == T(1.0) && T == Float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe x == T(1.0) && return T(2.718281828459045235360) so that we get the exactly rounded result for Float32 too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float32 is exactly rounded without this. So it's better to do this to remove the extra check for the Float32 version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between T==Float64 and T===Float64 I'm leaning towards using the later but for no reason

end
else
n = round(T(LOG2E)*x)
k = unsafe_trunc(Int,n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems bad that this will behave differently on 32-bit and 64-bit machines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't matter in this case. k is the integer representing the exponent of the float, so there is no way it will cause issues since the number will overflow way past when the Int32 will. We could all the same just use Int32 but I think it might be nicer to just use the machine size Int ?

Copy link
Member

@stevengj stevengj Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use Int32, I think. (We could even use Int16, it sounds like, but there's probably no advantage to that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change, but I don't fully understand why we should prefer Int32 over Int. I tried both and it's not like there is any critical difference in the resulting code.

@stevengj
Copy link
Member

stevengj commented Jan 3, 2017

Would be good to have test cases that exercise each of the branches, including test cases just above and just below the various thresholds like T(0.5)*T(LN2) (since that's usually where the error is maximized).

@stevengj stevengj added the maths Mathematical functions label Jan 3, 2017
@stevengj
Copy link
Member

stevengj commented Jan 3, 2017

Out of curiosity, why is it so much faster than the openlibm version? More inlining?

@musm
Copy link
Contributor Author

musm commented Jan 3, 2017

Would be good to have test cases that exercise each of the branches, including test cases just above and just below the various thresholds like T(0.5)*T(LN2) (since that's usually where the error is maximized).

I have carefully tested this function here: https://github.com/JuliaMath/Amal.jl/tree/master/test
(-10:0.0002:10, -1000:0.001:1000, -120:0.0023:1000, -1000:0.02:2000) (and more dense cases locally) (see https://github.com/JuliaMath/Amal.jl/blob/master/test/accuracy.jl) and have compared it with the Base version.

@musm
Copy link
Contributor Author

musm commented Jan 3, 2017

Out of curiosity, why is it so much faster than the openlibm version? More inlining?

Hard for me to tell without seeing something like code_llvm or for the openlibm version. It might be a combination of more inlining  for some of the constants (openlibm uses an array to store them for some of them). Is there any (even if minor) function overhead to calling C code? These functions run in 6-15ns so even a 1ns overhead could have an impact.

@stevengj
Copy link
Member

stevengj commented Jan 3, 2017

@musm, I don't doubt that you have tested your code, but we should have the key tests in our test suite, at the very least to make sure that we don't get regressions in the future.

@musm
Copy link
Contributor Author

musm commented Jan 3, 2017

Addressed some of the review comments.

I added a function to float.jl that gives us the corresponding Int size of a Float type.

fpintsize(::Type{Float64}) = UInt64
fpintsize(::Type{Float32}) = UInt32

This would make this and one of the other float function cleaner

reinterpret(T, rem(Int32(32), fpintsize(T))
# vs.  what we use now
xu = reinterpret(Unsigned,x)
reinterpret(T, rem(Int32(32), typeof(xu))

Need to still add tests and benchmarks

# integer size of float
fpintsize(::Type{Float64}) = UInt64
fpintsize(::Type{Float32}) = UInt32
fpintsize(::Type{Float16}) = UInt16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fpintsize sounds like it should return a number (64 or 8)... maybe fpint or fpinttype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fpinttype sounds good to me

@pure exponent_raw_max{T<:AbstractFloat}(::Type{T}) = Int(exponent_mask(T) >> significand_bits(T))



Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to get rid of all these additional white spaces :)

-1.65339022054652515390e-6, 4.13813679705723846039e-8)

# coefficients from: lib/msun/src/e_expf.c
@inline exp_kernel(x::Float32) = @horner(x, 1.6666625440f-1, -2.7667332906f-3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is such a low degree that I wonder if we'd be better off using a minimax polynomial rather than a minimax rational function, at least for the single-precision case — the optimal polynomial would require a higher degree, but it would avoid the division. I don't think we need to address that possibility in this PR, however, since this is no worse than what we are doing now.)

Copy link
Contributor Author

@musm musm Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a good question. I have already played with many variations and my conclusion is that the msun version is the best of all things I have tested and played with considering both accuracy (critical goal of being < 1 ulp) and performance.

Here's the problem with the minmax polynomial approach for the exp function:

on my test suite for Float32
FMA system rational
max 0.86076152 at x = -5.85780 mean 0.03712445

NON-FMA system rational
max 0.86076152 at x = -5.85780 mean 0.03712434

FMA system polynomial
max 0.84459400 at x = -48.8587 mean 0.03627487

NON-FMA system polynomial
max 1.09986448 at x = 5.23060 mean 0.03798720

over the range: vcat(-10:0.0002:10, -1000:0.001:1000, -120:0.0023:1000, -1000:0.02:2000)
(errors may be larger, since we haven't tested all floats)

as you can see it's not possible to be under 1 ulp for non-fma systems using this polynomial (below)

What about speed?
Well I tested 3 hot branches
and the polynomial version on an fma system is ~ 0.5-0.8 ns slower

For reference this is the best minmax polynomial I can find for the Float32

@inline exp_kernel{T<:SmallFloat}(x::T) = @horner_oftype(x, 1.0, 1.0, 0.5,
    0.1666666567325592041015625,
    4.1666455566883087158203125e-2,
    8.333526551723480224609375e-3,
    1.39357591979205608367919921875e-3,
    1.97799992747604846954345703125e-4)

Note it's not as simple as adding more degrees to this polynomial to improve the accuracy
This is the best minimal minmax polynomial I was able to find that meats < 1 ulp

For Float64

@inline exp_kernel{T<:LargeFloat}(x::T) = @horner_oftype(x, 1.0, 1.0, 0.5,
    0.16666666666666685170383743752609007060527801513672,
    4.1666666666666692109277647659837384708225727081299e-2,
    8.3333333333159547579027659480743750464171171188354e-3,
    1.38888888888693412537733706813014578074216842651367e-3,
    1.9841269898657093212653024227876130680670030415058e-4,
    2.4801587357008890921336585755341275216778740286827e-5,
    2.7557232875898009206386968239499424271343741565943e-6,
    2.7557245320026768203034231441428403286408865824342e-7,
    2.51126540120060271373185023340013355408473216812126e-8,
    2.0923712382298872819985862227861600493028504388349e-9)

I haven't tested this in a while but the same conclusions holds

@inline exp_kernel(x::Float32) = @horner(x, 1.6666625440f-1, -2.7667332906f-3)

# custom coefficients (slightly better mean accuracy, but also a bit slower)
# @inline exp_kernel(x::Float32) = @horner_oftype(x, 0.1666666567325592041015625f0,
Copy link
Contributor Author

@musm musm Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to edit this to @horner or delete the comment entirely

@@ -0,0 +1,137 @@
# Based on FreeBSD lib/msun/src/e_exp.c
# ====================================================
# Copyright (C) 2004 by Sun Microsystems, Inc. All rights reserved. Permission
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add exceptions to the license file and contrib/add_license_to_files.jl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't' really know what you mean. What do I need to do?

I looked under here for example:
https://github.com/JuliaLang/openlibm/blob/master/LICENSE.md

This part is kinda relevant

OpenLibm

OpenLibm contains code that is covered by various licenses.

The OpenLibm code derives from the FreeBSD msun and OpenBSD libm implementations, which in turn derives from FDLIBM 5.3. As a result, it has a number of fixes and updates that have accumulated over the years in msun, and also optimized assembly versions of many functions. These improvements are provided under the BSD and ISC licenses. The msun library also includes work placed under the public domain, which is noted in the individual files. Further work on making a standalone OpenLibm library from msun, as part of the Julia project is covered under the MIT license. The test files, test-double.c and test-float.c are under the LGPL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I just need to add a link to LICENSE.md to openlibm (it's already there) and say that was used as a reference or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Julia license file needs to have an exception here, and the script that adds license headers needs to add an exception for this file. It's derived code so should usually retain its original licensing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an example I can copy to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I should add this line

- base/special/exp.jl' (see [FREEBSD MSUN](https://github.com/freebsd/freebsd) [FreeBSD/2-clause BSD/Simplified BSD License])

below
The following components of Julia's standard library have separate licenses:

even if it's derived code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# ====================================================
# Copyright (C) 2004 by Sun Microsystems, Inc. All rights reserved. Permission
# to use, copy, modify, and distribute this software is freely granted,
# provided that this notice is preserved.
# ====================================================

do I even need to include this? We are distributing the original file.

Is it necessary to add a line here https://github.com/JuliaLang/julia/blob/master/contrib/add_license_to_files.jl#L39 to skip this?
I'm not sure. Parts of this include ported and derived code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright sun microsystems, permission to modify and distribute granted provided this notice is preserved. So it's not MIT license, copyright JuliaLang contributors like the majority of the rest of base. We're not distributing the original file in this repo under our MIT license.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I added the commit to address this : 46058d3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only did half of it.

Is it necessary to add a line here https://github.com/JuliaLang/julia/blob/master/contrib/add_license_to_files.jl#L39 to skip this?

Yes. "exceptions to the license file and contrib/add_license_to_files.jl"

@KristofferC
Copy link
Sponsor Member

It is very cool that it is possible to do this in julia.

@musm
Copy link
Contributor Author

musm commented Jan 4, 2017

Running nano soldier after JuliaCI/BaseBenchmarks.jl#58 should be an interesting test since there are quite a few other files in BaseBenchmarks that use the exp function

@musm
Copy link
Contributor Author

musm commented Jan 5, 2017

Ok so the benchmarks have been merged.

@musm
Copy link
Contributor Author

musm commented Jan 7, 2017

Should we trigger nanosoldier now?

@vchuravy
Copy link
Sponsor Member

vchuravy commented Jan 7, 2017

@nanosoldier runbenchmarks(ALL, vs=":master")

Also this should be squashed before committing it to master.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@stevengj
Copy link
Member

stevengj commented Jan 7, 2017

Performance looks good; the regressions are probably noise. But I don't think nanosoldier was running the new exp benchmarks?

@musm
Copy link
Contributor Author

musm commented Jan 8, 2017

Can we retrigger the benchmarks? They were just uploaded to nanosoldier. Thanks

@Sacha0
Copy link
Member

Sacha0 commented Jan 8, 2017

@nanosoldier runbenchmarks(ALL, vs=":master")

# compute approximation
z = r*r
p = r - z*exp_kernel(z)
y = T(1.0) - ((lo - (r*p)/(T(2.0) - p)) - hi) # compute polynomial on reduced arugment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argument

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@musm musm force-pushed the expfun branch 3 times, most recently from bd4945c to 7323b8f Compare January 8, 2017 05:57
@musm
Copy link
Contributor Author

musm commented Jan 8, 2017

Squashed and g2g on my end. The regressions don't make sense to me since those files don't even use 'exp'. Not sure why we also don't see more improvements as I locally have benchmarked on other branches in exp, but I think this is just some resolution/noise issue on nanosoldier.

@musm
Copy link
Contributor Author

musm commented Jan 9, 2017

bump

@stevengj stevengj merged commit f864b22 into JuliaLang:master Jan 9, 2017
@musm musm deleted the expfun branch January 9, 2017 22:14
vchuravy pushed a commit to JuliaPackaging/LazyArtifacts.jl that referenced this pull request Oct 2, 2023
* Add pure julia exp function

* Add a few more ldexp tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants